Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | Go Back
View Post [edit]

Poster: njwhite Date: Feb 11, 2014 9:08am
Forum: texts Subject: Using Tesseract to improve OCR for some languages

I've been using and improving Tesseract OCR for some time, in particular I developed a good training file for OCR of Ancient Greek (now part of the main Tesseract distribution).

The current OCR for Ancient Greek books on archive.org is garbage as it seems to treat it as Latin (though ABBYY doesn't support Ancient Greek anyway).

It would be great if the OCR process used the language metadata to select the OCR process to use, and in the case of Ancient Greek chose Tesseract.

Is that something that's being worked on? I'd be very happy to help however I can, if anybody is interested. I'm also very active on the tesseract-ocr mailing list if anyone wants to contact me there .

Nick White

This post was modified by njwhite on 2014-01-29 14:56:17

This post was modified by njwhite on 2014-02-11 17:08:48

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffJeff Kaplan Date: Jan 29, 2014 9:38pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Ancient Greek is one of the languages we can OCR using our Abbyy module.

Reply to this post
Reply [edit]

Poster: njwhite Date: Jan 30, 2014 1:36am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Oh really? I didn't know they had pre-built modules for that (the http://finereader.abbyy.com/corporate/tech_specs/ page doesn't mention it). Note that Ancient Greek is quite different to modern Greek (more diacritics, different vocabulary).

Is there some way you can automatically select the language to OCR? Because as I mentioned at present all the Ancient Greek books I've looked at in the archive appear to have been treated as latin, with unusable results.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or StaffJeff Kaplan Date: Jan 30, 2014 1:32pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

yes, if you use our html5 uploader with Chrome, Firefox or Safari at archive.org/upload you will see a language cell. Click in it and the dropdown menu includes Greek, Ancient.

Reply to this post
Reply [edit]

Poster: aibek Date: Jan 30, 2014 7:19pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

To see what Jeff meant by ‘Abbyy module’ see the attached extract from a log file.

I suggest that you upload a PDF file containing Ancient Greek exclusively, in the manner suggested by Jeff, and check what happens.

Attachment: Module-AbbyyXML.txt

Reply to this post
Reply [edit]

Poster: tfmorris Date: Feb 11, 2014 8:56am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Does anyone look at improving OCR quality on an ongoing basis, whether through the use of Tesserract or other means? Is OCR ever re-done after the initial pass?

I did a study recently scoring the OCR quality of public domain eBooks in IA and found the quality to be all over the map. I suspect that Tesserract could do a better job in many cases, but I also suspect that ABBYY could be improved as well.

I saw some anecdotal evidence that high processing loads on the OCR cluster and the use of "fast mode" was correlated low OCR quality. That seems fine as an interim if the books were later requeued for full processing, but that doesn't seem to happen.

Like Nick, I'd be willing to help with improving the OCR quality.

Tom